The problem: You've just completed an amplicon sequencing run on the 454 instrument, but because the sequences are longer than any you've previously generated on 454, you suspect that you may have sequenced through your reverse primers and into non-biological sequence (i.e., sequencing adapters).
The solution: You want to match your reverse PCR primer in each of the sequences, and remove that and all bases following it. To do this, you can use scikit-bio's global nucleotide aligner.
In [1]:
import numpy as np
from skbio.core.alignment.pairwise import global_pairwise_align_nucleotide
from skbio import DNA, SequenceCollection
First, we'll model some sequences. This is quick-and-dirty. Each sequences will contain some biological sequence (which is what we actually care about) with a mean/std length of 400/40, followed by one of four slightly different reverse primers (so representing a primer with 4-fold degeneracy), followed by some non-biological sequence with a mean/std length of 25/2. This is a reasonable representation of what we'd get off of the sequencing instrument: our reverse primer is somewhere in the sequence, but we don't know the exact start or end positions. (Note that I'm not modeling any sort of sequencing error here, and the biological sequence is random, which is not representative of what we'd have in an amplicon sequencing run.)
Follow the inline comments for descriptions of each step.
In [2]:
sequences = []
num_sequences = 50
mean_biological_sequence_length = 400
std_biological_sequence_length = 40
mean_nonbiologial_sequence_length = 25
mean_nonbiologial_sequence_length = 2
# imagine that we have four slightly different reverse primers
reverse_primers = ["ACCGTCGACCGTTAGGATT",
"ACCGTGGACCGTGAGGATT",
"ACCGTCGACCGTTAGGATT",
"ACCGTGGACCGTGAGGATT"]
for i in range(num_sequences):
# determine the length for the current biological sequence. if it's less than 1, make the length 0
biological_sequence_length = int(np.random.normal(mean_biological_sequence_length,
std_biological_sequence_length))
if biological_sequence_length < 1:
biological_sequence_length = 0
# generate a random sequence of that length
biological_sequence = ''.join(np.random.choice(list('ACGT'),biological_sequence_length))
# determine the length for the current non-biological sequence. if it's less than 1, make the length 0
non_biological_sequence_length = int(np.random.normal(mean_nonbiologial_sequence_length,
mean_nonbiologial_sequence_length))
if non_biological_sequence_length < 1:
non_biological_sequence_length = 0
# generate a random sequence of that length
non_biologial_sequence = ''.join(np.random.choice(list('ACGT'),
non_biological_sequence_length))
# choose one of the four reverse primers at random
reverse_primer = np.random.choice(reverse_primers)
# construct the observed sequence as the biological sequence, followed by the primer, followed by the
# non-biological sequence
observed_sequence = ''.join([biological_sequence, reverse_primer, non_biologial_sequence])
seq_id = "seq%d" % i
# append the result to the sequences list
sequences.append(DNA(observed_sequence, seq_id))
# construct a skbio.SequenceCollection containing all of the sequences just generated
sequences = SequenceCollection(sequences)
If we now want to view the sequences we just created, we can simply print the SequenceCollection
object, which gives us a fasta-formatted string representation.
In [3]:
print sequences
>seq0
TATTTCAGTATAAGTTAGCGACTGATGTCATAGTCTAGCGACGAGGACATCCACTCCGGATATATCGCTACGCAGGCGCCACACGACCGGGTCTTGCAGAAACGATGATACCCAATGAGAAGTGTCCAAGCACCGCGCGGACAGAATCCGCTACGAACTTCGCTCACCCGGCTATCTTGATTAACATCGTATGCTGATCCCGAAGGCTTTGAGCCCAGCCCTTTTATTCACGTATGGGAAGTGGTACCAGACTCTGATCATTCAAAAGGACCCAGGCACCGGGGGTGAGCTTTCACGCTATGTGAGGTCTGACATGCTGTTAGAGTAATGGCTAACCGAGGCCTCTTAGCCACTAGATAAGAGTATACCGTGGACCGTGAGGATTTAGGT
>seq1
TGAAAATCAGTATGGACGTCAGCTGCCACTGTCATTGAATTATAGACGTGCGATCCCTTCTCGGTTGGCAGCTAGGTTTTTTTTGAAGGTCGTTTACGGGAGCGTCGCGTTTCATCACGCCGGCCACTAAGGACGACACGGATCCACGTGTTGGTAAGCCATATGGCATATGTGTGTTATGTGGGCACTTTGGGTGTCTGTCCTGCATCTTAAGTCCAACTAAAGGTCTGCTAACTGTATTGAGGGGACATAGAATTGTGTCCCGGGCACAGCGTCACATCTGTCGGACGTCTACTGTAAGGATTATGCAGACCCACTCGACGCTGCGTGCGTCAGCACACAGTGGAGGGGCAAGAGCATGATTAGGGCCGAGTGTAAATATCGTTAACCCATGTCGCGTATGGGTCTTTGTCTGGCTATTTAACCGGTCCCGGGCGTTCGCGCCGGTCTGACCGTGGACCGTGAGGATTA
>seq2
GAGAGGGACGGAGTATTGTCAAATTACGTCGAATTACGCTATAACAGGCACAAATGCAGTCCGATTGCATGGTAACAATTTACACTTCTCAATTCCCCGGCGAGTTTCTTAACTATAACAACGCGGTTAAAGGTACCTGCTCGCCCATTGATGCGCATTGTATAAGCTACCTAGCGGACCCAGAATGTAAGAGTGCGTTCCGGGCCCCATATCCCGATTATTATCGTCCTCGCAGGCATGACGTCCGAAGTTCCCCGGCTACCTATGCCTCGGGCTGCGGTAGCCTTCGGAGCAATCACCGAACTCGAAGAGCAAGATGAAGAAACATGACCCACGCTGCTCAGAATCCGGGGAAGGTCGATTCCGGACCCCGGCGTATTCCAAGGGAATCAGATCATTAAGCAGAGGAAGAGCCGGGAGCGAATGTAGCCTTTGGTGTGGTTGGGACCATCCTGTCTTCACCGTGGACCGTGAGGATT
>seq3
GACGTGACCGCTACCTAACAGCCGTGGTCTAGGATGATCTCTACACGTCTGTGACTAGGATGCTACTAGGGTCAACAAATGCGTTTATGCGATGTGAACGTCATAGTGGAACGGTAATTACCTGATGGTACCGTAGTACAGTTATCCATTTCGGAGTCTAGGTGTAGAGCACACAATGGTTATCTAGGGTGTACGGCCATACCTTTAACTAAGTAGGAACGCGCCCATTGCGGTCCGGACGTGGGAACGAGAGGTCAAAAATTTAAGTAATCATAGGTCATGGAGCAATACCCAACTCAAGGGCAGTGAGCACTTTTGCGTCCCAAGTGCTCAATGTCCCGTTCTAGAGTACGCTGCAAGTCTATCAGAGCCGTGGGTGCGCTATAACGAACGACCGTCGACCGTTAGGATT
>seq4
GGGCGCAAGCCGTGCCCATTCGCGTCCCTCGTCACGGGGCGGTAATCTTGCTCAATCTCAGCAGCATTACATGCCCAATAAACGATCCCGCTGATCTGCTTGTCCCCACGACAGTACCCGAAATGTGTTGGGTCGATGACGTGGACAGTGCGGAATACAGATGCGGTTAGCGCTTTGTTTACAAACCCGATCAATAGTGATCGCTGTGTGGGGATCACTTACGATTTCACTCCAAATTGGTAGGGACGTATTTATTATACGGGGTCCTAGTCAGTCGCAGTGGGGGTGGACTTCAGCCGCTTATGACATGTGCACGCCCACGATTTGGGGGATAGCTGCGCCGTAGTGCGTTGGCCATGCCATAATTACGTTTTGAGCCAGGCCACCGTCGACCGTTAGGATTAGTA
>seq5
CGTCTTATTAGCAAACACCACGCAACCCTGAAAAGGTAAATCCGGCGTGCGACGGCCAGCCACTCCTGCAGCATGACGTTGCTCCAAGGTTAAAACGCGGCGCTGCACAGTGCTAAGCCCCGGCTTTGACCTTGTGGCGTCTACGTTATATTGTATCAAGAACCGTTCTCTGAGTTATTAGATAGGTGTGGAGACATGAACAAGTTGGGGTGAAATGGATGGTTTACCAAGGTGGCAAATAAGGATGGCGTGGAGACTTAATATTAGCAATCATCTCCGTAACCCGACAGAGCCCAGCCGAAGATTTGTTACGTTGTTACTGCGTGGCTAACCATGCACTACTTCCCTACGCTAAGAAAGAGTTAGCCCAAAAGAGCTATACGCTGTCTAGTAACGCCCGGGCTGCTGAAGTTGGCCTTATCAGTATCATTGATTTTATTACCGTGGACCGTGAGGATTGAT
>seq6
TCCACGTGCGTTCGGGCGTTGACGGCATGCCAAATTTGGCTCAGTGCTAGTTACACCTAGTCAAAACCCTCATGGTAGAAGTTATACCCTTTTATGAGCTTCTGCCGCTCTGTTCTAGGAGCCCCGGGGTCTATAGCCGCCAGTAATTGCGGATATGTCTTGCGCATAACGCAATATGGCTATTTCGCACGCGCCGGCGACCATGCGCCGTTCATAACATGGGGAGATGCACAAAACCTATACCTAAGACTACCATGATAAAAGGATAATCAGAGTGGGGGATCAGCCTAACTGCTGTGGAATCAATTCTTTTAACACCCAGAAGCATGCCACTGTAGGATGGCGTACGTCGCGATGAAACCGTGGACCGTGAGGATTGT
>seq7
TGACGTCCTAATATGGTTAGCGTAGCCACTGCATAGAACGAGACATTCACTACCTGGTAACATAGGATGCACTTTTTTTTGGGTGGCTCATATTCATCTCGCCATGTTGTCATCGTCCACCGACCCATTGGAAGTGACATAGAAGCATTACGCTGATAAGTTGTTAACGGTCGCAAGGCCGGACCGGACTATAAGCTGACTATTAATAAAGTAACATAGGTCATTAATCGATATTCATCCGATCTTGTAACATACCGATATAAATTAGAACGGATTGCCCTCCGCTGCTATTTCGTCTTGGGCCTGGTAAGCTTTATGGGTACTCGGCGGACCGCAATTACGACCTAACCAGCGTGCACGAGCAGTACCGTCGACCGTTAGGATTGCA
>seq8
GCCGAGAAGTTTGAGGCATCGGAGATAGTTTCACAGCAGCCGGGTCGCGATAGGTAACGAAAAGAACTTTTGGGTTACAGCAGTCTTTGAGGTAGGAGAGCAGGTACTGGCCGTTATTATGTCCGTAACCACCGTGGACGCTGTCATATACTGGATGCATCTCACTCTTAGGGAAACAAAGCCACCGCCCTCCTCTACGTAGAGAAGTGTTTGGGAGAATCCTGAGACCAGTAAGGGACATAAGACAATACGCGTGCACAGTCGTCCTGCTGTGTGAAACTTCCCGATATCTCGCCCTCGCGAGAAAGAGGTACAAAGCCATCCCGGTCAACGAGAGGGAGCTTCCCTGCAGACCCCCGAGACTTAGGAAACAAGACTTCGACGAAGGAGTAGGCTTTGCTGCGCGTTGAGCAGTCGCTACGCATTTCGACCGTCGACCGTTAGGATTT
>seq9
AGATTACAACTAGAGCTCCGAAGGCACAAACCCACGTTGGCGATATTACTTTTAATCCACTGGCCCAAACGCAAGCTACCCTTCTGATACCCTTCCGCGCGCCGACGCTGGTATTGCAGAGCGCGAGAGCATTTGATGATCCGGACGAGGTAGCGTATGAGTTGATGGTACGTCTTGCGACGAAGATGTACATCGAATTTGCCTGTTAAGTAACCGGCTGGGGCGTGCCAAGTGTTCCGACCCTCAGGTGAGTCTACGCACGGCCGTTGGTGCTACCTCAGAAAACCTGATGCCATCTCGGATCGACACTAGCCCCAGTAGCCAGCTTAAAGATAATCTTAGTTCCAATTAGGGTCCATGATCAAGAAGTCCCATTATAAGTATCCATTGCAACTTCTACCGTGGACCGTGAGGATTGGTGC
>seq10
CTGGTCTTACACTGTCTTGCGGATTCAGCTTGACCAGTCTTACGACGACTTCCCTTAATCCGTCCTATTTCACTTATGTACTGTTGAGACCGAGGCGAAAATAGGTGTCCAACTGAGGCCTGTCCAGTGGCAGACGGAGAATGTGACCGCCCCCTTGCCTGCTGTACACTAAAGATTCGATCGCCCGAAACAAGTGCATTCAATATCGCTACGGACATTGGAGTCGGAGAGGATCCGGAGGCGTATGGGGTATGGTACGTCCTCCCTGCGCAATTGAGCAAAGCGCCTGTAAAAAGACGTCTCTATAAGCCAGGTGGATAATCCACGGCAGTTAGTTTAACTTCACACCGTGGACCGTGAGGATTT
>seq11
AATGGCTGCGTCGGCATTACGCAGACATGGGTTACTATCACAATAGGTTCAAGCTTCCTTCGATAATATTGGTCGAATGCATGTCACCCCAAGCAGGGTGCACATCCTCTGATTATGTAGGTCACACGTACGTCTATGGTCCGGCAGTAATACTATTGCCTGTGTTAGAATACCTCTAACCCCGAGAGTCATAAGGCTCCCACTCGATGCCGATACTGCTCGGGACGAGATTTAGCATTTCTGATGTACCGATTACGAACAGAAAGTCAAGAACTAGTGACATACTATGCCTCTGTTACTACGAGGGAGATGCGTCCGGCACATTCGATGTATCGTATCAAGACCGTTGTTTGAGCTGAGGCGACCCTGAAAGATCCTCAACTAACATCCCATACAAATCCGGGTACAGCCTCAATCGTTGTTGCCAGACGTACAATACACACTTGAATGTATTAACCGTGGACCGTGAGGATT
>seq12
GCGTCGGTTACTGATCTGTTAGCCTCGTTGTTAATATGAGGAAGACGGGGACTTGCTGCTGCTCGAATTGGTTTGACGACAACCCATACCCTGCGGATCCAAGGTGCGTCCATACTCAGCTCTGGCCGGGGGACACAAACTTGATCTGCCCCGGAGATGTAGGAGTGGGCCGCGTGTTGCCGGGAAGTGACCTACACCCCTTGCCTCCGGCGCGTATCCACGGTCACGCTCGTGGCCGTCAAGTAGGGTATTTTTGGTCCCTTCGGATTCAGTGGCCACAGATGCCAAAACAGTGGGTGTGACTGAGACATCGCCGGCTCTCGTCTGACAGAGCGACGCACTACACTTTAGACTATCCCTGGCCTCCGGACGGTGCCACCGTTTGATAAATATAGGTCATTCTCGCCTACCGTGGACCGTGAGGATT
>seq13
CAACCCCTACGGCAATTCACACGCCGGGAGACGCACTCACGTCTTGGGGGTGAGGGAATCGCCTTTGCGCGACTTCTCGACGACAGGGCGCGAGCCTACGGATTACCGACGATGTCCAGGACCGAGTTAACGCAGCGGGCACACTAAATTGGGATTGGCTGCCCCTTGGGGAAGTGACGCCTGGTGCGTGGGAGCCGCTCGTCAAGGCCGCCGCGCTCATTGTTATCGGCACGGCGGAAATAGATAACCGCGAAATCTTGTTCCGCGTCCAATAAGGTTATCCTTCTCCCATCGGTGAACAGCTGACTTACTCTTACGCACTGGTAGTCACTTCGCTTTAACTACTTATAATAACAAGACATGGCCACCTTACTGTCGGACGCGGCCCATATTCTCGTCTCCTTAAAATTAACAGGACCGTGGACCGTGAGGATTTCA
>seq14
CATTAGATATAAATGCTCCCCCTTAAGTTCAGCTCCATAGCCCCAGGGAATCATTTCGGAGTGTGTCGAATGGATGTACAATCCGCACTAGGTGACTACGCTCGTAATCTACCGTGAATGGATCAGATATCCTATACGTATGTCGTAAAGAACATAACTTGTGGAGTCACGTCGTAGTTGGCGAATCGTCTCACTTGACGAACGAGTAATCTTGGGGAGCGGCAGGCCTACAGCACGGGCGAACCTTCATTCGCACCGCCAGCCCCTACTCACTACTGTTATGCCAGAACATTTTATGAGCCCTCCCTGCCGTGCACAAGGATACAGGTACATGGAACGTCTCCGTGGGTGTTTGCGAAATGACCTGTCTGTGAGCTATCACAGCGGCTATGAAAACATTGAACGCGGAGGTGCCAGTGGACCGTGGACCGTGAGGATTCA
>seq15
AAGGTGCAACACTCACTACAGTGGTTACTTTAAGACTAGACCTGGCGCCGCATCTCTTTGCATCTCGGGCATATTGTTTCCGGGTCGGCGGTATGCTCCGTATCCTACTGCCACTGTAACTTTTTGAGCAGTGTGCTCCAAACGAGCAGGGTCGATCTGACATTTAGTGCTCATCCCAGGATGTGCATATAGACAGGACCAACTGCCGGGTGACTATGAGCTAGGTGGAACTAACTCCACCACTCGCCAGATGGAACAGCTAACCCTTTAGTACTCTTGCTTGACTACAGCAAGGTCCATTTTCTAAGGTTTGGGTGCATCGCAAATGCCAATAGCTACGTGCCCTATAGCCACTTCCTACTAGTTTATTGAGTGGTTTGTCATGCACGGATATACCCAGTGTGTTCCCTCCTTACCTGCTGACCGTCGACCGTTAGGATT
>seq16
AGATTACTGTGCTCAAAGTGAAGTCTCTGAAACAAGTAAGAATTGGAGACAAGAATTAGGTTAGGGCGTTTTAGTCATGAAGGCAAGGTAGCACGAGAATCCGGCTCACGGCCCTCACCAACCTCACTAGAGCCGTGGGTTATCGTCGGTCTAGAGAAAATTATGACTCTTTCAGACGACTGTTCAGAGAAGCCGGCCAATCTATGTCATAGACTAATTGTTTTATCTTTCACGCTTAGGTGCGTACCAGCTCTGAGAACATAAAGGCACGAAACGGTCAAAAGCCTTAGTTGTTGCACTGCAAGTAAATTGACGATAGTGACCCCCGAGCTAAGATTACTACTCGCAAATACCATGGATCTGACTCCAAGGATAGTCAGATCCCCCCGGCCGCCTTCCGAGAAAATTATAGCATGATGGATACAGAGACCGTGGACCGTGAGGATTGT
>seq17
TCTGAGTGCCTTTGTCTTGAATGCGAAGTCTGGCAGCATTCAGGTGGTGGATCTCACCGCCATCGGTACTGGGCGTATTTCTACTTCATGCGCGTTTGTGGGGGTGCCCTTCACGCATAATTAGCACGTCCCGCCCATCGGACGAAATTAGCTCCTGACGGGCCATTCTGCCAGGTTCCTTGGAGCCTCGCACTCGAGACAGGGGTATTGCCTGCCTAGTTTGGAATCGTGTTGAATTATGTTTAGAACAACTCCCCGTGCCTGACGCTGGGAGGGCTGAAAATCTCCGTCTGCTAATTCAGTGCTTATCACGCACTGGCTTCGGCATTTCACGGGGGCAGCAATCATGGCCGGCGACGTTGTTACATCGCTACCTATTATTAGGGTGCAGTTTAGCTGCACGAATAAAACCGTCGACCGTTAGGATTCTA
>seq18
CCTCGCATTGTGATGCTCGCCCCGGACGCAGGCGAGCGGGTACCCTGAAATAGAAGCACAGACTCTCGCTTACTTATTTGCAAGCACGATCCTAGATAATTGGCACGTGTTTCGGTCAGGTTCTGTAGACAGAACGGTCGGGGTGGCTTGGGAGACCAGCCGACTATCGAGTAACAGTCAACTGAAGATTGTCCCCCCGGAACAGGGAATCCATTTAGTGGGTATGTGATCCAGACGTTTCGACTCCTATTCATGTTCCGACCCTGCTTGAAGTGCTAGGTCACGACGTATGACTATACGCATTCACCCGGACACTCGATGGGTCTATCGCTCGAGAACAAATTGAGTTGGCGGGATACGTGCCGAGCAGAAGCCCATACGATAGTTACTCGATGCTCCGATGCAGGTGCAACATACCGTCGACCGTTAGGATTCA
>seq19
TCTCTTCGTACTAATCCCTAACCATGCACCGGAAGTCATACGTAGCAAATGACCTTTCAGTGCCCGATTATCGGTAACGCATAACTTCGAGGTTGCCGGCATCCCAGGCGGACCGGCAAAACAAGAAACAGCTGCGTACTACCATTTTTACGTTCCGAGCGGCATGATGGTAGCCCTGTGGAAATACAGCCCCGGACGGACTCCTTAATACGTCATGATTAATCGCGCGGTTTCTCCGCCTCCTCGACTGGTCCTCAAGCCTATAATCCGCCGACTGGAAAGTACCGTACGCCAGCAACGTAGCCTGTGGAAAATGTTTAGGTCAGTCGAAACACGTAACCGTCGACCGTTAGGATTGC
>seq20
ACAGACCCTCCGCCCAGCGTAGCTAACCAGCAATTAAAGTTTAGAGCGAGTGGGTATCAGGTTAAATCGGAGGCGCTAAAGTAAACTAAGGGTCCCTACGAAGGCGTTGGGGATTCGTTAGACGAGAGTCGCTGACTGCGCATAAGGTCCATCCCATCTTGAGTGGGTACACGACAATAAAATTAAGTTGTGGCTATGGGACGCGGCTCAAGAATGAGTGTAACCGTAGATCGGGAAACTTTTTTAACACGTACTGGCACCGAGGTTCCTAGTAGTTGACTAGTGGTTGGTAGGGGGCAAAAGACGCGCAGAATTGATCGCGTTTAAATTTGACTACAGAACCGGAGGGAACGTTCAGGTGTGCGAGGAAATGACAGTTTGAGTTTATAAGCCATATCGCACGAACCGCTACCGTCGACCGTTAGGATT
>seq21
GATACACACTTGTCCACGTTCAATCACACAGCTCAGCGGAGATACAGTCAAGTAGCGCGACCTGATGCTTCTATTTACGCGGGTGACAATCGTCATCAGATCCGAACCTCCTGCACGGATCTTTCAGCAGCAGTCTATCCTGTCGACGGCTCTATGACAGGCCGAGCTTCATCCGTTGGTTCACTAGTACCGTATGGGGCTCAGTCTGCAACCACTCCACCACACTAATAACCTTGAGTTGCTGCATGGGGGGGGGTCGCATAGCATTAAGACCCTGCGCTCACGTTACAGCTAGAAGTTCTCTCGAATTGCGGCAAAGCGAAGCCACTCTGCTGCATCAACTAACCACCGTCGACCGTTAGGATT
>seq22
GTAATGCGAAGTAACGTCGATAACGTGCTGTTAGCCAGTGTTCGAACGGCGTAGGGTCTGGGAGGGAGCTGCGTTAAAAATGGAGTGGGTACATATCAATATTAGACTGGGCTTTAGAAACGCTGTCTCGGTATACGGGGAACGGCAGTGAAGATAGGTCAGAAATGGTGGACACTACGGGCCTACGAACTACCTTAGTACACTATACGGCGGGACCAAAAGGCCCTTTTTAAACAGAAAGCACGCGGCAGTGTGCATATTCTTTGGAAGCGCCCTAATGGCCGACATTTCTGCGCGACAAGCGTAAAGACTTACCCCAGACAGAAAGCTCAATTGCTAATAAGAGCGACTGTTGCCAAGGCCCTGCACGTATACGGCGTATGCTCGACCCTGGAATGTCGAAATTCTAATACCGTCGACCGTTAGGATTTAGA
>seq23
GATCAGTCAATCAATAGCATCAGAAGCTGTACCGGGACACATGGATATCTGATTAATGCCCCGTGCGGTCTTCTGTCAAGAGAAGTCCAACAGAACACCAAAATTGCCGTGCCACTGGCCCTAGGCGGCTAGTAGTAAGTACGAGAAGTCGATGACTCTCACTCACCGTGTTCCGGCAACATCATCCCTGAATTCCTATTATATGGACTCTCCATTGCACCATTCTTCAAAACCCTCGCCCACCCATATGGTCCTTAAGTGCGTCGCTGAACCGACCCAATTACACCTCTACTCAACGCAGTAACGTGCAGGTAACGGGTTAAATTTATATAGGAGGTCTAGAACGTAACCGTCGACCGTTAGGATTCAT
>seq24
ACAGAGAGACAATTATAGCAGAAGAGTTTTTTAACTTTGTGATGTGTGAATATCATGCCGGGAACGACTGGGGAGAGTTTGTACGAAAAGTGTAGGAGATAGGGAGTACCACATCTAGACTTCCCTTTGTGAGCCGTAGCGTCTTACTAGTGTCACCTTTCGCAATGCCTGTGCCACCTAAATATTCGCAAACTGGTGATAGCGGGATCAAATTCGTGTAGCATAGCTTCTATGATTCAACTGACACAATCTAATTCAGCCTGCACTCCATCAGGGGTTTCTTGCGCCCTGGGCAGGCGTTTACCGGGGAGTAGGGCGACGAAGATACAGCGCCAGCTGCTGTTTCGCGCATGAGGCACGCAGCCCGACACAATATAGAGCGATGCAATGAGTAGTGCGGTGAAACCGTCGACCGTTAGGATTCTC
>seq25
AGTGGATGGGACGCCGGATTCAGAGGAGGGTAAACCACGTGGCACAGCCATTAGAGGATCGGTAGCTTCGTCCACGTTCTTCAAACTAAGTACCATACAAATACACGTCATCCGCGGCTGTTGCGGCTAAGAGAAAACATTCTGCCCACATCGACACACACTAGGGAAACTGGGAAGTCCTTAAAACTACGTCTCAATGCGAGGCCCGTACCTCACGCTTCGAGCCTAATCGTCTAAGTAGTCGTATCAAAGTATGTCTAGTTCGGCACAAACCCTCGCAATCGCACGCCGTCAGTTCGGCCAGTGTGTTTTTCTCATATGCTGGGACTCTCCCCATTGTCAAAGGCTCAATATGTAGTAGAATCCCATTCCATGTCGCTGGGGCTTCTTGCGAAAGACGCGGTTGGCAGTGTGTTGGTACCGTGGACCGTGAGGATTCG
>seq26
CTATATGTTCTTCGAGAATTAAATTGCAACAACAGTCGCGCAGTCGTCGCCGGGAAGTTTTCGTTTACTTAGAATCTACAGGTATCTAGTAATCTGCAGGTATCCTACGTTTCTACCTTCCATCATCGGAATGCAAGAGCGTACAAGATCTTCCTACACCCTCCCTTAAGCTAGAATATAAAGCAGCACGTTAAGAAGATTCTAACACGGGTAAGAAGTGCAAGTCAGATCAGTGCCGACGGAGGTAGTTTTCGCAACAGGCCCACGCTATAAGGAAGCTTGGTCGCAGGACTACCACACATGGCTGTTGGAGAGCGGGTGTGGAAAGCACACACACCTCTCTGATAGGTCAAAAGGTTCACGGGGCGCGAAATTAATATTATTGAATATAGTCTACGAACCTAGAAACTTACCTAGCTACTTTCGACAGTCTTACAACCCGTATCTTTGTGTCAGGTTAAGCGTTGCTGACCGTCGACCGTTAGGATTC
>seq27
CATCAGTAATCCCATCAGCCGGGGCGGACGTCTGGAGATCCTATGCGCAGCTCTGAATCTATTTGTGGGAGAACATCACAGAGAGATGAACCGTTCCGGATGAGTCAAAAGGACGAGTTTCCGAAGGGATAATACCGGATATCGACTGGGGTCGGCGATGCCCCTAGACGACACACAAGCCAAATCGCGCTCACAGATATCAATAGATAGAGCGTTGTAACCTACATCTCTGATTTCAAATAATAAGACGAACTCCAAGCGCCAGTCATGGAATCTTTAGGCAGCATGCGTGCCACTTGTTAATGGATGCCACCCTGAACCCGCGGGGGGGGCGCGGGAGCCACTCATACCCATCGCTGTAGACTTAATGGAGAAATGATTTCGAACTAAGCACGACCCTCGTAGGTGCTGCGATTCAGGATAGTACCCATCTTTCCCTCATTGAACCGTCGACCGTTAGGATT
>seq28
TCTAAGGCCCCGGATCCTGCGAACCATAACCGTGGCCGACAGATTTATGGAAACCTCCTTCACGTAATGGGCCAGCGACTCCCATTACAATCGCGTAAATGAATGTGCCTTTCTAAACCCGATCGCAGTTACAGTCCCATATATACTCCCAATTGGAATCGGAGCACTGACCTTGTTTTACGGTCCGGAGGTATCAATCTATCCTCGATTATTCCCGCACCCCCATAGCAGGCTGCCCAATCCGCATAATGCATCGAAGCGAATTACATGAAGTTGGCACGTTCATGGCCGTAAGAGATGTAGCCCCCTAGTATCAGTGCCCAACCTACCGTGGACCGTGAGGATTTAGA
>seq29
ATGATTAACAGGCATGCTCTTTGTGCGATTTACTAGAACTTTTGAACCATATTCAAGGAACCGATGGGATCCTTTTCACAAAACATTCCCAATGTCACATACAATATTGTAGCAAGGTTCTGAAGGTGTCCTAACGAGCCGCAAACTGGTACCTCAAGGAAGCTGTTGAACTCAGTCCGGACCGATCGGCGAGAAATCAACTTATAAGACACAGCACACTTATTTTGGTGACTGTGTCACCGATGATCCTGTGAAAATACTCTGACCTTGGTTGACAGGGGTATGGCTGCCAAGATATACTATGTATCTTCATAGAGGGTGAGCGCCGTGACCGGATTTAGTCCGGAGACACGGCATGATACATTTATTGGCAGGGCCTACCGGTAGTACTAGTCAAGGTTTAACCGTGGACCGTGAGGATT
>seq30
TGGCCCATCGTAACACCGACGAGCTCCACGGTTGTTTACCCTGCAACGTACACCCGAGTCGTGTGAAGGCCGGGGTCAGCCCCAAAATAACCTACCACATCTAACCCTCTCAACAAATTTTTTCGAGAAGTATTCTGATGTGACTCAAGCGGCACCCACTATGACAGCCCTAACGAATAACTTGATAATTCTCGTAAGCCTTAAGCGCCTCAGGTTCGGGACTCGGGCCCTATCGGGTGATTTTACACAGCGTCTCAGGGACGGAATACGGAAGGCCTATTTAGGCTCCATTATACTTCAACTGGCGGTTATCGTGATGAAAGAATCGTAAGAGCAAAGAAAAGATGACGGTCCTGTCACAACGCACCAACCGTGGACCGTGAGGATTT
>seq31
ACTATTGCCCTATCGTGGATCACGATGGAGACGCTCCCTAACTGGCCTCTTGTCAGCATTATCCCTAGTTCTTTCACCGAAAAACCACTGCCTGACTCTCTGCTTCTCTGAAATGAATATACCCAAGATAACTGTATTGCTGCTGAGACCGGAGTGACCCTACTTCTCTTCGCACTCTCAAGTCCTACAACCTATTTATCGCCAGCCAAGCAGCTGAGATGGGACTCGTAGAAAACGGACCTATACTCGTCGGCATCCCGCCTATGTTAGGCCGCGAAAATTATGACCCCTACCGCTATTCTATACAAAACGATATGTTGACCGAGAAGTAAGGCGTGGGTTGGTTCTACGCTACACCCGATACTAGATGGAACGATTTACGTATCCCCGTATCGAACGACGACAACGTGGCCAAACCGTCGACCGTTAGGATTG
>seq32
TCCAAATGGCGGTCCTAACCACGCTCACAGTATAATTAAGACTGCATGTGGGTTAGGTACCGCCCAAGGAAAACAATGCCGATGAGTCGGACAACTGTCGGCGCAAACCCTATGGATTGTGAAAAAATAGACTACATGATGCATGGTCATAGCAGGTGTCTTCCACCTCATTTGAGAGTTATGCAGCAGACTCGGTTTCACCCATTTTCCGACGCCCGTCCTTGTTTAGCTACAATCGCAGAACGACGTCCCGCGACTGCAGAGTGGTTGCCCAGAATATCGCTCATATGGTCAATGAAGCCGCGGACATGTGACCTGGAATGTCATAGCTAAGGTCTCGAACGGTCATATTAGAGGCATGGAACCGGTCCTTTATGGCGCGGTGCCAACCTTACACCGTGGACCGTGAGGATTC
>seq33
TCACCTGGTGTCAAGGGGCGTTGTGACGATCTACCTGTCCAACGTAAAATCCTCACAGCTAGTAATAATATATGAATTGCCATTGGAGTACGGAACCGTCCTCCTCTGTACACTGCTATCAGAGTTCCAATATTTAGTCGTAGTTCGACTGGGTGGGTATGAGTAGTACTGTCCAGGCCCTCCATGCAGCGAGTTCTCATTTCCGATTACCTGTCTTCTTTCGCTGTAAAGAATCCATTCCCGATCGTCATCGGATAACCTTCGCCAACGCAATCAGGAGTGGATACGAATAGAACTGTCAAGTGGACACTATTCACGTTCTGGCTCACATAACCTTCCACTTAGATTTTCTTATACCGTGGACCGTGAGGATTCC
>seq34
TAAGAGGGTTAGCGGTTGTTATTAGATGTCAGGCCAGCTTTCCGCCTCTGATCGGAATAGACAGCATTACCGCAGCCGGAAGTTTGTTGATCAGAATCCGAGGGTCGTGGACGATCCCTTCACTTTAGCACCCGACCGGTGGCCCCTACCCCAGCCCACGACGAATGCGAATGAAGGATAACAGGAAGCGTGTGTCGATAGCTCTCAGTTCCAGGGGGACACCGCCTAGGAACAAATCGTGCCATGGTATCGCACTTGCTAGAGTAACTACCCACGAGGAAATATGCATGCGTCCGACCTTTGTAGAAAACCTCTCTTTTAACTCACCGGAAATATAGAGCTCCTGTTGTTTTCAAGGCTAGTGAATGACGCGAAAATTCATGTATGCTAACCGTGGACCGTGAGGATTC
>seq35
GTGACTCAACCCGTATCGCCAGTTAGGTTCACCAAATTTTAGATCTGGAGTGCCAGTGCTGGTAAGCACACGTCAGTTTTGGGTAACCGAAGGCATTAAAAAAAGAAGAGTAGGGGTGCAATGGTCACAGTTTGAGATCTGGTCCGAGTGGACGAAATTGACAGTTGCCCAGCAAATCCGATGATACTTCGAAGGCGGGCCGTTCCAAGGCCGCTATCCAAGTACCTGTTGCGGCCCTGAGTTAGTATCTTGGCTTAATGACTCCCTTTAGTTAGTACAATCTAGGGCGCTGCTGGACTAGGTAGTCGGATTTAAGACCCGCGCGTGTGTGCGCTGCTTTTGTGAATACAGAGAAAATCAACCGCGCGGCCCGGTTAGATCCTATGGCCGACTTGTACGATGGATTGCCTTGCACGCTATTAATGAGACCCACGCCCCCCGGAATACATCCTCGTAGCATACCGTGGACCGTGAGGATTAT
>seq36
TCCGCACCTGTACCTCGGGCGGCTGTCTCCGTTCTGGGACACCACCCCCGCGGGTAATAATTAGATAGACACCAGACATACTACCGCGTGCTAGACGGAACCTAGATCAGGCCGCGACTACCCTCGCTCCCTAGGGCTGTTCCAGCGACCCATGATCCTGGGCTGGGCCTACCATATGTAACGTAAAACGCAGGACGGAGAGTCCCCTCGTCCCGGAAATCACTTCTACGCCATAATACGTTCAGCTAGAAATAGACCTGTTTAGGGCACTGTACTTCACCTATCGAGTAAGAGCAATCGAAGCGGGATCTTACAATCGAAAGCATCCTGCGGTTAGGCCAATAATTCCGCGAATCGGGAAGCTTCGACTTTCACCGGACTGGGAAAAAACCGCCACCGTCGACCGTTAGGATTG
>seq37
CAAATGTCCAGCTCACGAACCTCACCATTTGAAGGAAGAAAGGTCTACGAGACGAGACTGGGTGGACCTATCAGAATTGCCCGCGTGCTACTCTGCAGTGTTTTCAAGCTACGATTCGCTACCAAATTCAATTTGGAACTGCGCCGGTTGATGCTTCTATGCACTCCGTTCGTGTGGAGGAGAATGCACTCCTACGTTGTGGATGTTTTATATCCCAACGATGGTGGAATCGGCGCGTTACGTCCGTAGGAATTAGGGCGCGGTTACGCACATAGAGAATGATGATCTGGACACGATGTAAAGTATTCGGGGCAAGTATCATCCGTAAACGTTTTTCCTTCTGTCTCGCCCGGTACCGTCGACCGTTAGGATT
>seq38
TGCATAGCCACAAAAGAATCGGTCAATGAAATATACAGAGTACGTCTCTAGGGGACCGCGTCGCCTACCAATTTTCACGCTTCTGCCTGCCGATTAATCCGCAGAGCTTATAAACTTGGTGTGGGGGGGGTATTCTAATCTCACGCGTCAAAGCCCGTGTGCGGAACCCCAATGTACCTGAATAAATCAGCGCCGTTACCGGTAGGACTAAGTACAGTAGGCGTTTTACTAGGCAAAGCGGCGCTGGTGAGCAATAACGTGATAATCTCGTAGCAGGCGGATCCTAGATTGCATAAGTTGGTCACTGACACAGGTCGTCGCTTATAGGATGCGTATGTTGCACGGACCGAGCGTAGTTCGGCGTCTTGACACCGTGGACCGTGAGGATTGA
>seq39
GAGAAAGTTTAATCATGATTCTTTACACGAGGCAGGAATTCTAGTGGTCGCGACGCATCGAACCATTAAACTTTCCCGTGTCTCCTCAAGAGGCAGTCTGACAGCGCCCAAAAATCCTACCTCTAGTCATTACCATAATAGAAGGTAAGTCTGAGGCAAAGCCTGTTACAGGTCCGACAATACAACCCCCGAAGGCCTCCCCCATTCGAAAGTTGCGGATAGAACCATGGTACAGGAACGTACTTCGTCGATTCCACTAACAACAAAGAAAGTCATGTCAGTTTCCCTTGTCCAACTCGAATATTGGAAGCTAATGTATTATCGTAAACTATACTGTCCTGTAAACATAGGTGAGCTTCCCCGCACTAAAAGTCGCAGGTGACTTCTGAGCGAGGATGGGAGTACCGTGGACCGTGAGGATTCA
>seq40
TGATTTTTATGTAATACTCCGCTGCTTGGCACCGATGGTGAGAGTTACCTTTGCCACAGGGATGGCGCGTCGAAGCTACGTATGACAAGCGAACGCTTGCGTGGATCTTCCGATAATGAGCCATGTGAGAAATTGAGAGGGCTGAGCATACCTTCATTATCTGAACGTGTCTTCTTCAACCAAGCTGGGGGACGAGCGACTCGTGATTCCGACACTTGTCAAAAGTATGATTATGCGCTTGTTTCCGTTGCCCCGCAGAAGAGAGAATCTGGCCCGGTAAGATAGGTCGTGGTCGTCTTCGCAGAACGGGATTAGCAAGCACCACACATCGGCCCACGCGCTCTCTTACTGGGACCATCAATTTGACCGTCGACCGTTAGGATTAACC
>seq41
GCGGACACAATCTGACGTCTGCAAAGGGAGACGGTATTATTTTTTATACCGTTCGATAGGTTGAATGAGCGACGCGAGGTGACCTCCAGAGGATGACTCTGGGCGCTTTCCATAATCATTCTACGCTTCGCACCGCCATAGGCCTGTCCTCCGAAAGTGTTGTGTACAAGCGCAGTGCGTGCTCGGTAATCTTCGACACCGGGGTCCTGGCCGTGACTTGCCTATGCTATCCGCGTGATCCGCTCCGGGCGCACATACACGGGCCCTCCAGCCTGTCACTCGTTATAATCGACACTACTACCCGTTCCCCCTGAATGAACAACTTCGGTACCGGGGAATTCATTGATCGCCGCGGTTCCCGGCTGAACTGTTCCCCCAACCGTCGACCGTTAGGATTGGCGCTC
>seq42
GCCCATTTGCCGGCTTGGTCCCATACAGTCCGTCGGTTCCGTTTGAGTAACGGCACGTATTCACCGAAAGGACCTAGTCCGTGAAGATGACGCGGCCCTCTTGTCACGCAGTCTCCAAAGATGATTAACGACACCCTTTACCTGGCCGCCTGTTACGAGTCGACAAGTAAGGCCATAGTATGGGCCTTGGTGGGTGCTTTCGTCGGTCCCTAACATAGCTAGACGACTGCCGTGCAGACCGGACGATTCCGACCAAGTTCCGCCTATACAGTTCCAGCGTACAGATCCGAAGGTGGCGACCGGCCTGGGATCAAATACATGTACACTTAGTTTACCTGAAATTCTTGTTATGGGGTCTGGTAGCGAGGCGGTGCGCCCCACCGCTATCGTCGTCTCCGGGTTACAGTTGGGAGACCGTGGACCGTGAGGATT
>seq43
GAAACACTAAGGGAAGTCGTCGCGGTTTGACCTCCCAATGGAGGGTTTCCCGGCCTTATGAACGTTCTCAAGGGGCTAACTGGAAGCCACTTGTCTATGAGCGCCGCCCTGATATAGTCAATGCTGAACTGGGATCATCGATGACGGTATATTTACAGCCTAGAGGCCGACTGGGGCGGTAACTCGTAGGGTAAGTTTAGTTAAATCTCTCATAAGTAATAATAGAATTAGATCCTCCAAACGCTTGAAAATTGCATTGGGATGGCTGTAGAAGCATAGTATGATACTGCTAACTCAGGGCACTCGGAGGGTATGCGATGTCACAGACGTGAAGGACCAATCATAGACCGTGGACCGTGAGGATTTC
>seq44
ATGAGAAGCCTGATTACCGTTACGACACAGTGCGCCGGTGACGTGTAGCGTAATTTGATAGACAGTATGCTATAATCACTTAGACACGCATAATCATGGGGAACTACAGAATACCGGTGGATAAATTGTCACTAGAATTGTTGCTACCCTTTAGGCAGCGGTGAACCTGTTCCATCTCGGTACTCGCCTCTACGAAGATGAAGCCAGGATGTAATAGGTCGGGGTCGAGATGGGTCTGTAATCTATGCTGCAACAGTGAAAGGCACGGCCTAATGGTTGTAGAGCTCGGCATGACCGACGAGACTAGCACGAGGGTGTTGGGGTCTGGCCGCGCAGAAATTATCGGTGTCGCATCCCTAGTTAATTGATGTACAGACCGCCCGGCTCTTCCCCACGTTGCGTTGAATGCAATGCCTTCCATGTTCAAAATGAGTCTGCTCATTGAGCCCGTACCGTGGACCGTGAGGATT
>seq45
AATACGTCCAGTCGGCGAACGTGTAGGTAGAGCGGGGCCAAGGCCCTTTATGCTTGGCTTTCTCTTCAGCTCTATAACGTTGACCGAACCAATATCCCTTCGTAGGGTATATATAGACCTGGGCTTGATCGGAAGTATGGTGCAATAATGCCATTTTCTCTACAGACGCTGCACCGTGCGGACCGCGGTATTTCGATTTGACTCTAAGGAGACTCGTGCGCAACAGGGTTTTATTTCGTCATCCTTGTAACAATTAGTCCGGAGACGGCATCCAGGGACCCAGACTACCATGAGACCGCTACTCGGAGGCGCAGCCCCTTAATACTTTATAGGATGCCGATCAGAGCTCTGAAGTGCTCGTTGCAATACCGGCTTCAGACTTCGATGATACCGTCGACCGTTAGGATTGG
>seq46
CCCGCCTTTCCATGCGGCGCCCCGCTGTTCGGATTAAGAGTGATGTTCATGACGTAACCTCCATTCCAGGCCGGAATCACTTAGTACGCCGATCGTCGAGATAGTCTTAAGTAGCTACTGAACTGAGACATTAGTCGTCAAGGGAGATTTTGATTTGTATAAGCTGGTACAGACCATAGATCAATTAGTGCCTCTTTGGAGTTGACTGGCCTGTAGGTACGTCTATGTAGCCGAGTAGTGCACTATGCGCTTTATAGCTCCCAATAATTTGTAGATAATCATAGGTATGGTGTGGGGTCTCCTGCCCGATCTCCGAGATTCGCCGTAGAGTCTTACGTTTTGAGGTAGCGATGCAATAGGTGAAGTCGTATAACCGTCGACCGTTAGGATTTCATT
>seq47
CATTTAACGTGGGTCTCGCCGCAGGGCCGTCCTGCATAATTGCTCCCGCTCACGCTTGCAGTTATGTAGTTTGATCGCCGTATGCGCCGAAAGGGTCGCGTCCGTCATGGAGATGTCGCACTGAAGGGGAGGCCATCAATGAATTCCAAATCGTTGGGATAAAACACTTCCCGTAAAGATTCGGTACGGACACTCAAGTATTAACGCGCCAATCATGCCAGTGTGATGCAACTAAACCCCGTCTAAAGGTCCATACCATTGTACTCGAAGCTACCTAGGATGGCGCGAGATACTTAGGGGTTGCCCGGGGACACTCGGGGTCCGAGCCCTCCCTCTAGCAGGGGGTATAGACACCGTGGACCGTGAGGATTTTTAAC
>seq48
GCAAAGATATTCTCAAAGCTTAACTTCGTGTCGATACGTGTCTTTTAAGGAAGCCGTAGCACATTGCACTTCCGGTTCCGAACGAATATGCAAAGCTACTACGGCTAATAAATCCTCGTGATCTCATGATAACAACTAGATTGAGATCGGCTCCTAGCGCCACTACCGGCGATTGTCCTGGAGCCGGTCCGTATTGAACGGGCGACGTTCAGCAGTTGGCGATGCCAGATCAGCGCACCCGCGTAAACAACTGAGGTTACTGAAGTAGCCATAGTGCCCATATTGCTGTATCAGTGCTGGCCCCTACATACTTTCCTTATACAAAGTTTCGTTTATAGCTAGTATGCGCTGATACTACTTAGCCCACGGTTGATCACTGTCGCCGTGAGTCCCCGATTACCGTCGACCGTTAGGATTCGGT
>seq49
GTCTGGTTGAAGAATCCCGAGGCATGCTCGATAGCGGAAAACTTGTCCAGAGGCATCATGTCGGCTAACGCTGGGCTTCTCTCAGTCGGCCAGCCATTTTATATGCGGGTCCACGAGCGAGAAAGAAACATACAGCGGAGGAACGCTTTAGATGATAGATTGTTAGCGTGCCTATCTAACCGGTGCTTGGGCCTGATCGAAATTCAGAGCGAAACGGATCGGTTGGAAAAGCGGTTCGAGTACAGTACCTACACTACGGGGTGTACGCCCTCCCACGTGCTAAGGAAACCAGTCTCCCGTTTTAACAACCGTGATAAAATGACGGCGACAAGTTTTGTCTCAAACCCAAGCCTCGGCTGCATTGTAATAGCCCCGAAATCTAAGACTTAGGTGCCGGTCCTAGATCAGGGGGGAAAGTGGCTACCACCGTGGACCGTGAGGATTACTAA
Now to get to the problem at hand. How do we find the primer sequence in random sequence. The answer is with global alignment. If we align the first reverse primer to the first sequence, we get an skbio.Alignment
object back. If we print that object, we again get a fasta-formatted string. This lets us see how the sequences aligned to each other.
Notice that in this step we get an EfficencyWarning
. That's because scikit-bio currently only has a python implementation of global alignment, which is slow because it's a computationally complex algorithm. In the future, we'll have a C-based implementation which will be much faster.
In [4]:
aln = global_pairwise_align_nucleotide(reverse_primers[0], sequences[0])
print aln
>0
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ACCGTCGACCGTTAGGATT-----
>seq0
TATTTCAGTATAAGTTAGCGACTGATGTCATAGTCTAGCGACGAGGACATCCACTCCGGATATATCGCTACGCAGGCGCCACACGACCGGGTCTTGCAGAAACGATGATACCCAATGAGAAGTGTCCAAGCACCGCGCGGACAGAATCCGCTACGAACTTCGCTCACCCGGCTATCTTGATTAACATCGTATGCTGATCCCGAAGGCTTTGAGCCCAGCCCTTTTATTCACGTATGGGAAGTGGTACCAGACTCTGATCATTCAAAAGGACCCAGGCACCGGGGGTGAGCTTTCACGCTATGTGAGGTCTGACATGCTGTTAGAGTAATGGCTAACCGAGGCCTCTTAGCCACTAGATAAGAGTATACCGTGGACCGTGAGGATTTAGGT
/Users/caporaso/Dropbox/code/skbio/skbio/core/alignment/pairwise.py:531: EfficiencyWarning: You're using skbio's python implementation of Needleman-Wunsch alignment. This is known to be very slow (e.g., thousands of times slower than a native C implementation). We'll be adding a faster version soon (see https://github.com/biocore/scikit-bio/issues/254 to track progress on this).
"to track progress on this).", EfficiencyWarning)
We next want to find the start position of the primer sequence in the sequencing product, which we can do using the gap_vector
method of the first sequence in the alignment (to learn about gap_vector
, see its API documentation. The following tells us where the first non-gap character in the primer alignment is, which is the position in the sequencing product where the primer match begins.
In [6]:
gap_vector = aln[0].gap_vector()
primer_start_index = gap_vector.index(False)
print primer_start_index
366
So, we can slice the original sequence through that position, and the result will be our sequencing product minus the reverse primer and the non-biological sequence.
In [7]:
print sequences[0][:primer_start_index]
TATTTCAGTATAAGTTAGCGACTGATGTCATAGTCTAGCGACGAGGACATCCACTCCGGATATATCGCTACGCAGGCGCCACACGACCGGGTCTTGCAGAAACGATGATACCCAATGAGAAGTGTCCAAGCACCGCGCGGACAGAATCCGCTACGAACTTCGCTCACCCGGCTATCTTGATTAACATCGTATGCTGATCCCGAAGGCTTTGAGCCCAGCCCTTTTATTCACGTATGGGAAGTGGTACCAGACTCTGATCATTCAAAAGGACCCAGGCACCGGGGGTGAGCTTTCACGCTATGTGAGGTCTGACATGCTGTTAGAGTAATGGCTAACCGAGGCCTCTTAGCCACTAGATAAGAGTAT
Finally, if we want to do this for all of the sequences, we can embed the above steps in a loop over the SequenceCollection
.
In [8]:
trimmed_sequences = []
for sequence in sequences:
aln = global_pairwise_align_nucleotide(reverse_primers[0], sequence)
gap_vector = aln[0].gap_vector()
primer_start_index = gap_vector.index(False)
trimmed_sequences.append(DNA(sequence[:primer_start_index], sequence.id))
trimmed_sequences = SequenceCollection(trimmed_sequences)
We can then print the result, and we'll have acheived our goal.
In [9]:
print trimmed_sequences
>seq0
TATTTCAGTATAAGTTAGCGACTGATGTCATAGTCTAGCGACGAGGACATCCACTCCGGATATATCGCTACGCAGGCGCCACACGACCGGGTCTTGCAGAAACGATGATACCCAATGAGAAGTGTCCAAGCACCGCGCGGACAGAATCCGCTACGAACTTCGCTCACCCGGCTATCTTGATTAACATCGTATGCTGATCCCGAAGGCTTTGAGCCCAGCCCTTTTATTCACGTATGGGAAGTGGTACCAGACTCTGATCATTCAAAAGGACCCAGGCACCGGGGGTGAGCTTTCACGCTATGTGAGGTCTGACATGCTGTTAGAGTAATGGCTAACCGAGGCCTCTTAGCCACTAGATAAGAGTAT
>seq1
TGAAAATCAGTATGGACGTCAGCTGCCACTGTCATTGAATTATAGACGTGCGATCCCTTCTCGGTTGGCAGCTAGGTTTTTTTTGAAGGTCGTTTACGGGAGCGTCGCGTTTCATCACGCCGGCCACTAAGGACGACACGGATCCACGTGTTGGTAAGCCATATGGCATATGTGTGTTATGTGGGCACTTTGGGTGTCTGTCCTGCATCTTAAGTCCAACTAAAGGTCTGCTAACTGTATTGAGGGGACATAGAATTGTGTCCCGGGCACAGCGTCACATCTGTCGGACGTCTACTGTAAGGATTATGCAGACCCACTCGACGCTGCGTGCGTCAGCACACAGTGGAGGGGCAAGAGCATGATTAGGGCCGAGTGTAAATATCGTTAACCCATGTCGCGTATGGGTCTTTGTCTGGCTATTTAACCGGTCCCGGGCGTTCGCGCCGGTCTG
>seq2
GAGAGGGACGGAGTATTGTCAAATTACGTCGAATTACGCTATAACAGGCACAAATGCAGTCCGATTGCATGGTAACAATTTACACTTCTCAATTCCCCGGCGAGTTTCTTAACTATAACAACGCGGTTAAAGGTACCTGCTCGCCCATTGATGCGCATTGTATAAGCTACCTAGCGGACCCAGAATGTAAGAGTGCGTTCCGGGCCCCATATCCCGATTATTATCGTCCTCGCAGGCATGACGTCCGAAGTTCCCCGGCTACCTATGCCTCGGGCTGCGGTAGCCTTCGGAGCAATCACCGAACTCGAAGAGCAAGATGAAGAAACATGACCCACGCTGCTCAGAATCCGGGGAAGGTCGATTCCGGACCCCGGCGTATTCCAAGGGAATCAGATCATTAAGCAGAGGAAGAGCCGGGAGCGAATGTAGCCTTTGGTGTGGTTGGGACCATCCTGTCTTC
>seq3
GACGTGACCGCTACCTAACAGCCGTGGTCTAGGATGATCTCTACACGTCTGTGACTAGGATGCTACTAGGGTCAACAAATGCGTTTATGCGATGTGAACGTCATAGTGGAACGGTAATTACCTGATGGTACCGTAGTACAGTTATCCATTTCGGAGTCTAGGTGTAGAGCACACAATGGTTATCTAGGGTGTACGGCCATACCTTTAACTAAGTAGGAACGCGCCCATTGCGGTCCGGACGTGGGAACGAGAGGTCAAAAATTTAAGTAATCATAGGTCATGGAGCAATACCCAACTCAAGGGCAGTGAGCACTTTTGCGTCCCAAGTGCTCAATGTCCCGTTCTAGAGTACGCTGCAAGTCTATCAGAGCCGTGGGTGCGCTATAACGAACG
>seq4
GGGCGCAAGCCGTGCCCATTCGCGTCCCTCGTCACGGGGCGGTAATCTTGCTCAATCTCAGCAGCATTACATGCCCAATAAACGATCCCGCTGATCTGCTTGTCCCCACGACAGTACCCGAAATGTGTTGGGTCGATGACGTGGACAGTGCGGAATACAGATGCGGTTAGCGCTTTGTTTACAAACCCGATCAATAGTGATCGCTGTGTGGGGATCACTTACGATTTCACTCCAAATTGGTAGGGACGTATTTATTATACGGGGTCCTAGTCAGTCGCAGTGGGGGTGGACTTCAGCCGCTTATGACATGTGCACGCCCACGATTTGGGGGATAGCTGCGCCGTAGTGCGTTGGCCATGCCATAATTACGTTTTGAGCCAGGCC
>seq5
CGTCTTATTAGCAAACACCACGCAACCCTGAAAAGGTAAATCCGGCGTGCGACGGCCAGCCACTCCTGCAGCATGACGTTGCTCCAAGGTTAAAACGCGGCGCTGCACAGTGCTAAGCCCCGGCTTTGACCTTGTGGCGTCTACGTTATATTGTATCAAGAACCGTTCTCTGAGTTATTAGATAGGTGTGGAGACATGAACAAGTTGGGGTGAAATGGATGGTTTACCAAGGTGGCAAATAAGGATGGCGTGGAGACTTAATATTAGCAATCATCTCCGTAACCCGACAGAGCCCAGCCGAAGATTTGTTACGTTGTTACTGCGTGGCTAACCATGCACTACTTCCCTACGCTAAGAAAGAGTTAGCCCAAAAGAGCTATACGCTGTCTAGTAACGCCCGGGCTGCTGAAGTTGGCCTTATCAGTATCATTGATTTTATT
>seq6
TCCACGTGCGTTCGGGCGTTGACGGCATGCCAAATTTGGCTCAGTGCTAGTTACACCTAGTCAAAACCCTCATGGTAGAAGTTATACCCTTTTATGAGCTTCTGCCGCTCTGTTCTAGGAGCCCCGGGGTCTATAGCCGCCAGTAATTGCGGATATGTCTTGCGCATAACGCAATATGGCTATTTCGCACGCGCCGGCGACCATGCGCCGTTCATAACATGGGGAGATGCACAAAACCTATACCTAAGACTACCATGATAAAAGGATAATCAGAGTGGGGGATCAGCCTAACTGCTGTGGAATCAATTCTTTTAACACCCAGAAGCATGCCACTGTAGGATGGCGTACGTCGCGATGAA
>seq7
TGACGTCCTAATATGGTTAGCGTAGCCACTGCATAGAACGAGACATTCACTACCTGGTAACATAGGATGCACTTTTTTTTGGGTGGCTCATATTCATCTCGCCATGTTGTCATCGTCCACCGACCCATTGGAAGTGACATAGAAGCATTACGCTGATAAGTTGTTAACGGTCGCAAGGCCGGACCGGACTATAAGCTGACTATTAATAAAGTAACATAGGTCATTAATCGATATTCATCCGATCTTGTAACATACCGATATAAATTAGAACGGATTGCCCTCCGCTGCTATTTCGTCTTGGGCCTGGTAAGCTTTATGGGTACTCGGCGGACCGCAATTACGACCTAACCAGCGTGCACGAGCAGT
>seq8
GCCGAGAAGTTTGAGGCATCGGAGATAGTTTCACAGCAGCCGGGTCGCGATAGGTAACGAAAAGAACTTTTGGGTTACAGCAGTCTTTGAGGTAGGAGAGCAGGTACTGGCCGTTATTATGTCCGTAACCACCGTGGACGCTGTCATATACTGGATGCATCTCACTCTTAGGGAAACAAAGCCACCGCCCTCCTCTACGTAGAGAAGTGTTTGGGAGAATCCTGAGACCAGTAAGGGACATAAGACAATACGCGTGCACAGTCGTCCTGCTGTGTGAAACTTCCCGATATCTCGCCCTCGCGAGAAAGAGGTACAAAGCCATCCCGGTCAACGAGAGGGAGCTTCCCTGCAGACCCCCGAGACTTAGGAAACAAGACTTCGACGAAGGAGTAGGCTTTGCTGCGCGTTGAGCAGTCGCTACGCATTTCG
>seq9
AGATTACAACTAGAGCTCCGAAGGCACAAACCCACGTTGGCGATATTACTTTTAATCCACTGGCCCAAACGCAAGCTACCCTTCTGATACCCTTCCGCGCGCCGACGCTGGTATTGCAGAGCGCGAGAGCATTTGATGATCCGGACGAGGTAGCGTATGAGTTGATGGTACGTCTTGCGACGAAGATGTACATCGAATTTGCCTGTTAAGTAACCGGCTGGGGCGTGCCAAGTGTTCCGACCCTCAGGTGAGTCTACGCACGGCCGTTGGTGCTACCTCAGAAAACCTGATGCCATCTCGGATCGACACTAGCCCCAGTAGCCAGCTTAAAGATAATCTTAGTTCCAATTAGGGTCCATGATCAAGAAGTCCCATTATAAGTATCCATTGCAACTTCT
>seq10
CTGGTCTTACACTGTCTTGCGGATTCAGCTTGACCAGTCTTACGACGACTTCCCTTAATCCGTCCTATTTCACTTATGTACTGTTGAGACCGAGGCGAAAATAGGTGTCCAACTGAGGCCTGTCCAGTGGCAGACGGAGAATGTGACCGCCCCCTTGCCTGCTGTACACTAAAGATTCGATCGCCCGAAACAAGTGCATTCAATATCGCTACGGACATTGGAGTCGGAGAGGATCCGGAGGCGTATGGGGTATGGTACGTCCTCCCTGCGCAATTGAGCAAAGCGCCTGTAAAAAGACGTCTCTATAAGCCAGGTGGATAATCCACGGCAGTTAGTTTAACTTCAC
>seq11
AATGGCTGCGTCGGCATTACGCAGACATGGGTTACTATCACAATAGGTTCAAGCTTCCTTCGATAATATTGGTCGAATGCATGTCACCCCAAGCAGGGTGCACATCCTCTGATTATGTAGGTCACACGTACGTCTATGGTCCGGCAGTAATACTATTGCCTGTGTTAGAATACCTCTAACCCCGAGAGTCATAAGGCTCCCACTCGATGCCGATACTGCTCGGGACGAGATTTAGCATTTCTGATGTACCGATTACGAACAGAAAGTCAAGAACTAGTGACATACTATGCCTCTGTTACTACGAGGGAGATGCGTCCGGCACATTCGATGTATCGTATCAAGACCGTTGTTTGAGCTGAGGCGACCCTGAAAGATCCTCAACTAACATCCCATACAAATCCGGGTACAGCCTCAATCGTTGTTGCCAGACGTACAATACACACTTGAATGTATTA
>seq12
GCGTCGGTTACTGATCTGTTAGCCTCGTTGTTAATATGAGGAAGACGGGGACTTGCTGCTGCTCGAATTGGTTTGACGACAACCCATACCCTGCGGATCCAAGGTGCGTCCATACTCAGCTCTGGCCGGGGGACACAAACTTGATCTGCCCCGGAGATGTAGGAGTGGGCCGCGTGTTGCCGGGAAGTGACCTACACCCCTTGCCTCCGGCGCGTATCCACGGTCACGCTCGTGGCCGTCAAGTAGGGTATTTTTGGTCCCTTCGGATTCAGTGGCCACAGATGCCAAAACAGTGGGTGTGACTGAGACATCGCCGGCTCTCGTCTGACAGAGCGACGCACTACACTTTAGACTATCCCTGGCCTCCGGACGGTGCCACCGTTTGATAAATATAGGTCATTCTCGCCT
>seq13
CAACCCCTACGGCAATTCACACGCCGGGAGACGCACTCACGTCTTGGGGGTGAGGGAATCGCCTTTGCGCGACTTCTCGACGACAGGGCGCGAGCCTACGGATTACCGACGATGTCCAGGACCGAGTTAACGCAGCGGGCACACTAAATTGGGATTGGCTGCCCCTTGGGGAAGTGACGCCTGGTGCGTGGGAGCCGCTCGTCAAGGCCGCCGCGCTCATTGTTATCGGCACGGCGGAAATAGATAACCGCGAAATCTTGTTCCGCGTCCAATAAGGTTATCCTTCTCCCATCGGTGAACAGCTGACTTACTCTTACGCACTGGTAGTCACTTCGCTTTAACTACTTATAATAACAAGACATGGCCACCTTACTGTCGGACGCGGCCCATATTCTCGTCTCCTTAAAATTAACAGG
>seq14
CATTAGATATAAATGCTCCCCCTTAAGTTCAGCTCCATAGCCCCAGGGAATCATTTCGGAGTGTGTCGAATGGATGTACAATCCGCACTAGGTGACTACGCTCGTAATCTACCGTGAATGGATCAGATATCCTATACGTATGTCGTAAAGAACATAACTTGTGGAGTCACGTCGTAGTTGGCGAATCGTCTCACTTGACGAACGAGTAATCTTGGGGAGCGGCAGGCCTACAGCACGGGCGAACCTTCATTCGCACCGCCAGCCCCTACTCACTACTGTTATGCCAGAACATTTTATGAGCCCTCCCTGCCGTGCACAAGGATACAGGTACATGGAACGTCTCCGTGGGTGTTTGCGAAATGACCTGTCTGTGAGCTATCACAGCGGCTATGAAAACATTGAACGCGGAGGTGCCAGTGG
>seq15
AAGGTGCAACACTCACTACAGTGGTTACTTTAAGACTAGACCTGGCGCCGCATCTCTTTGCATCTCGGGCATATTGTTTCCGGGTCGGCGGTATGCTCCGTATCCTACTGCCACTGTAACTTTTTGAGCAGTGTGCTCCAAACGAGCAGGGTCGATCTGACATTTAGTGCTCATCCCAGGATGTGCATATAGACAGGACCAACTGCCGGGTGACTATGAGCTAGGTGGAACTAACTCCACCACTCGCCAGATGGAACAGCTAACCCTTTAGTACTCTTGCTTGACTACAGCAAGGTCCATTTTCTAAGGTTTGGGTGCATCGCAAATGCCAATAGCTACGTGCCCTATAGCCACTTCCTACTAGTTTATTGAGTGGTTTGTCATGCACGGATATACCCAGTGTGTTCCCTCCTTACCTGCTG
>seq16
AGATTACTGTGCTCAAAGTGAAGTCTCTGAAACAAGTAAGAATTGGAGACAAGAATTAGGTTAGGGCGTTTTAGTCATGAAGGCAAGGTAGCACGAGAATCCGGCTCACGGCCCTCACCAACCTCACTAGAGCCGTGGGTTATCGTCGGTCTAGAGAAAATTATGACTCTTTCAGACGACTGTTCAGAGAAGCCGGCCAATCTATGTCATAGACTAATTGTTTTATCTTTCACGCTTAGGTGCGTACCAGCTCTGAGAACATAAAGGCACGAAACGGTCAAAAGCCTTAGTTGTTGCACTGCAAGTAAATTGACGATAGTGACCCCCGAGCTAAGATTACTACTCGCAAATACCATGGATCTGACTCCAAGGATAGTCAGATCCCCCCGGCCGCCTTCCGAGAAAATTATAGCATGATGGATACAGAG
>seq17
TCTGAGTGCCTTTGTCTTGAATGCGAAGTCTGGCAGCATTCAGGTGGTGGATCTCACCGCCATCGGTACTGGGCGTATTTCTACTTCATGCGCGTTTGTGGGGGTGCCCTTCACGCATAATTAGCACGTCCCGCCCATCGGACGAAATTAGCTCCTGACGGGCCATTCTGCCAGGTTCCTTGGAGCCTCGCACTCGAGACAGGGGTATTGCCTGCCTAGTTTGGAATCGTGTTGAATTATGTTTAGAACAACTCCCCGTGCCTGACGCTGGGAGGGCTGAAAATCTCCGTCTGCTAATTCAGTGCTTATCACGCACTGGCTTCGGCATTTCACGGGGGCAGCAATCATGGCCGGCGACGTTGTTACATCGCTACCTATTATTAGGGTGCAGTTTAGCTGCACGAATAAA
>seq18
CCTCGCATTGTGATGCTCGCCCCGGACGCAGGCGAGCGGGTACCCTGAAATAGAAGCACAGACTCTCGCTTACTTATTTGCAAGCACGATCCTAGATAATTGGCACGTGTTTCGGTCAGGTTCTGTAGACAGAACGGTCGGGGTGGCTTGGGAGACCAGCCGACTATCGAGTAACAGTCAACTGAAGATTGTCCCCCCGGAACAGGGAATCCATTTAGTGGGTATGTGATCCAGACGTTTCGACTCCTATTCATGTTCCGACCCTGCTTGAAGTGCTAGGTCACGACGTATGACTATACGCATTCACCCGGACACTCGATGGGTCTATCGCTCGAGAACAAATTGAGTTGGCGGGATACGTGCCGAGCAGAAGCCCATACGATAGTTACTCGATGCTCCGATGCAGGTGCAACAT
>seq19
TCTCTTCGTACTAATCCCTAACCATGCACCGGAAGTCATACGTAGCAAATGACCTTTCAGTGCCCGATTATCGGTAACGCATAACTTCGAGGTTGCCGGCATCCCAGGCGGACCGGCAAAACAAGAAACAGCTGCGTACTACCATTTTTACGTTCCGAGCGGCATGATGGTAGCCCTGTGGAAATACAGCCCCGGACGGACTCCTTAATACGTCATGATTAATCGCGCGGTTTCTCCGCCTCCTCGACTGGTCCTCAAGCCTATAATCCGCCGACTGGAAAGTACCGTACGCCAGCAACGTAGCCTGTGGAAAATGTTTAGGTCAGTCGAAACACGTA
>seq20
ACAGACCCTCCGCCCAGCGTAGCTAACCAGCAATTAAAGTTTAGAGCGAGTGGGTATCAGGTTAAATCGGAGGCGCTAAAGTAAACTAAGGGTCCCTACGAAGGCGTTGGGGATTCGTTAGACGAGAGTCGCTGACTGCGCATAAGGTCCATCCCATCTTGAGTGGGTACACGACAATAAAATTAAGTTGTGGCTATGGGACGCGGCTCAAGAATGAGTGTAACCGTAGATCGGGAAACTTTTTTAACACGTACTGGCACCGAGGTTCCTAGTAGTTGACTAGTGGTTGGTAGGGGGCAAAAGACGCGCAGAATTGATCGCGTTTAAATTTGACTACAGAACCGGAGGGAACGTTCAGGTGTGCGAGGAAATGACAGTTTGAGTTTATAAGCCATATCGCACGAACCGCT
>seq21
GATACACACTTGTCCACGTTCAATCACACAGCTCAGCGGAGATACAGTCAAGTAGCGCGACCTGATGCTTCTATTTACGCGGGTGACAATCGTCATCAGATCCGAACCTCCTGCACGGATCTTTCAGCAGCAGTCTATCCTGTCGACGGCTCTATGACAGGCCGAGCTTCATCCGTTGGTTCACTAGTACCGTATGGGGCTCAGTCTGCAACCACTCCACCACACTAATAACCTTGAGTTGCTGCATGGGGGGGGGTCGCATAGCATTAAGACCCTGCGCTCACGTTACAGCTAGAAGTTCTCTCGAATTGCGGCAAAGCGAAGCCACTCTGCTGCATCAACTAACC
>seq22
GTAATGCGAAGTAACGTCGATAACGTGCTGTTAGCCAGTGTTCGAACGGCGTAGGGTCTGGGAGGGAGCTGCGTTAAAAATGGAGTGGGTACATATCAATATTAGACTGGGCTTTAGAAACGCTGTCTCGGTATACGGGGAACGGCAGTGAAGATAGGTCAGAAATGGTGGACACTACGGGCCTACGAACTACCTTAGTACACTATACGGCGGGACCAAAAGGCCCTTTTTAAACAGAAAGCACGCGGCAGTGTGCATATTCTTTGGAAGCGCCCTAATGGCCGACATTTCTGCGCGACAAGCGTAAAGACTTACCCCAGACAGAAAGCTCAATTGCTAATAAGAGCGACTGTTGCCAAGGCCCTGCACGTATACGGCGTATGCTCGACCCTGGAATGTCGAAATTCTAAT
>seq23
GATCAGTCAATCAATAGCATCAGAAGCTGTACCGGGACACATGGATATCTGATTAATGCCCCGTGCGGTCTTCTGTCAAGAGAAGTCCAACAGAACACCAAAATTGCCGTGCCACTGGCCCTAGGCGGCTAGTAGTAAGTACGAGAAGTCGATGACTCTCACTCACCGTGTTCCGGCAACATCATCCCTGAATTCCTATTATATGGACTCTCCATTGCACCATTCTTCAAAACCCTCGCCCACCCATATGGTCCTTAAGTGCGTCGCTGAACCGACCCAATTACACCTCTACTCAACGCAGTAACGTGCAGGTAACGGGTTAAATTTATATAGGAGGTCTAGAACGTA
>seq24
ACAGAGAGACAATTATAGCAGAAGAGTTTTTTAACTTTGTGATGTGTGAATATCATGCCGGGAACGACTGGGGAGAGTTTGTACGAAAAGTGTAGGAGATAGGGAGTACCACATCTAGACTTCCCTTTGTGAGCCGTAGCGTCTTACTAGTGTCACCTTTCGCAATGCCTGTGCCACCTAAATATTCGCAAACTGGTGATAGCGGGATCAAATTCGTGTAGCATAGCTTCTATGATTCAACTGACACAATCTAATTCAGCCTGCACTCCATCAGGGGTTTCTTGCGCCCTGGGCAGGCGTTTACCGGGGAGTAGGGCGACGAAGATACAGCGCCAGCTGCTGTTTCGCGCATGAGGCACGCAGCCCGACACAATATAGAGCGATGCAATGAGTAGTGCGGTGAA
>seq25
AGTGGATGGGACGCCGGATTCAGAGGAGGGTAAACCACGTGGCACAGCCATTAGAGGATCGGTAGCTTCGTCCACGTTCTTCAAACTAAGTACCATACAAATACACGTCATCCGCGGCTGTTGCGGCTAAGAGAAAACATTCTGCCCACATCGACACACACTAGGGAAACTGGGAAGTCCTTAAAACTACGTCTCAATGCGAGGCCCGTACCTCACGCTTCGAGCCTAATCGTCTAAGTAGTCGTATCAAAGTATGTCTAGTTCGGCACAAACCCTCGCAATCGCACGCCGTCAGTTCGGCCAGTGTGTTTTTCTCATATGCTGGGACTCTCCCCATTGTCAAAGGCTCAATATGTAGTAGAATCCCATTCCATGTCGCTGGGGCTTCTTGCGAAAGACGCGGTTGGCAGTGTGTTGGT
>seq26
CTATATGTTCTTCGAGAATTAAATTGCAACAACAGTCGCGCAGTCGTCGCCGGGAAGTTTTCGTTTACTTAGAATCTACAGGTATCTAGTAATCTGCAGGTATCCTACGTTTCTACCTTCCATCATCGGAATGCAAGAGCGTACAAGATCTTCCTACACCCTCCCTTAAGCTAGAATATAAAGCAGCACGTTAAGAAGATTCTAACACGGGTAAGAAGTGCAAGTCAGATCAGTGCCGACGGAGGTAGTTTTCGCAACAGGCCCACGCTATAAGGAAGCTTGGTCGCAGGACTACCACACATGGCTGTTGGAGAGCGGGTGTGGAAAGCACACACACCTCTCTGATAGGTCAAAAGGTTCACGGGGCGCGAAATTAATATTATTGAATATAGTCTACGAACCTAGAAACTTACCTAGCTACTTTCGACAGTCTTACAACCCGTATCTTTGTGTCAGGTTAAGCGTTGCTG
>seq27
CATCAGTAATCCCATCAGCCGGGGCGGACGTCTGGAGATCCTATGCGCAGCTCTGAATCTATTTGTGGGAGAACATCACAGAGAGATGAACCGTTCCGGATGAGTCAAAAGGACGAGTTTCCGAAGGGATAATACCGGATATCGACTGGGGTCGGCGATGCCCCTAGACGACACACAAGCCAAATCGCGCTCACAGATATCAATAGATAGAGCGTTGTAACCTACATCTCTGATTTCAAATAATAAGACGAACTCCAAGCGCCAGTCATGGAATCTTTAGGCAGCATGCGTGCCACTTGTTAATGGATGCCACCCTGAACCCGCGGGGGGGGCGCGGGAGCCACTCATACCCATCGCTGTAGACTTAATGGAGAAATGATTTCGAACTAAGCACGACCCTCGTAGGTGCTGCGATTCAGGATAGTACCCATCTTTCCCTCATTGA
>seq28
TCTAAGGCCCCGGATCCTGCGAACCATAACCGTGGCCGACAGATTTATGGAAACCTCCTTCACGTAATGGGCCAGCGACTCCCATTACAATCGCGTAAATGAATGTGCCTTTCTAAACCCGATCGCAGTTACAGTCCCATATATACTCCCAATTGGAATCGGAGCACTGACCTTGTTTTACGGTCCGGAGGTATCAATCTATCCTCGATTATTCCCGCACCCCCATAGCAGGCTGCCCAATCCGCATAATGCATCGAAGCGAATTACATGAAGTTGGCACGTTCATGGCCGTAAGAGATGTAGCCCCCTAGTATCAGTGCCCAACCT
>seq29
ATGATTAACAGGCATGCTCTTTGTGCGATTTACTAGAACTTTTGAACCATATTCAAGGAACCGATGGGATCCTTTTCACAAAACATTCCCAATGTCACATACAATATTGTAGCAAGGTTCTGAAGGTGTCCTAACGAGCCGCAAACTGGTACCTCAAGGAAGCTGTTGAACTCAGTCCGGACCGATCGGCGAGAAATCAACTTATAAGACACAGCACACTTATTTTGGTGACTGTGTCACCGATGATCCTGTGAAAATACTCTGACCTTGGTTGACAGGGGTATGGCTGCCAAGATATACTATGTATCTTCATAGAGGGTGAGCGCCGTGACCGGATTTAGTCCGGAGACACGGCATGATACATTTATTGGCAGGGCCTACCGGTAGTACTAGTCAAGGTTTA
>seq30
TGGCCCATCGTAACACCGACGAGCTCCACGGTTGTTTACCCTGCAACGTACACCCGAGTCGTGTGAAGGCCGGGGTCAGCCCCAAAATAACCTACCACATCTAACCCTCTCAACAAATTTTTTCGAGAAGTATTCTGATGTGACTCAAGCGGCACCCACTATGACAGCCCTAACGAATAACTTGATAATTCTCGTAAGCCTTAAGCGCCTCAGGTTCGGGACTCGGGCCCTATCGGGTGATTTTACACAGCGTCTCAGGGACGGAATACGGAAGGCCTATTTAGGCTCCATTATACTTCAACTGGCGGTTATCGTGATGAAAGAATCGTAAGAGCAAAGAAAAGATGACGGTCCTGTCACAACGCACCA
>seq31
ACTATTGCCCTATCGTGGATCACGATGGAGACGCTCCCTAACTGGCCTCTTGTCAGCATTATCCCTAGTTCTTTCACCGAAAAACCACTGCCTGACTCTCTGCTTCTCTGAAATGAATATACCCAAGATAACTGTATTGCTGCTGAGACCGGAGTGACCCTACTTCTCTTCGCACTCTCAAGTCCTACAACCTATTTATCGCCAGCCAAGCAGCTGAGATGGGACTCGTAGAAAACGGACCTATACTCGTCGGCATCCCGCCTATGTTAGGCCGCGAAAATTATGACCCCTACCGCTATTCTATACAAAACGATATGTTGACCGAGAAGTAAGGCGTGGGTTGGTTCTACGCTACACCCGATACTAGATGGAACGATTTACGTATCCCCGTATCGAACGACGACAACGTGGCCAA
>seq32
TCCAAATGGCGGTCCTAACCACGCTCACAGTATAATTAAGACTGCATGTGGGTTAGGTACCGCCCAAGGAAAACAATGCCGATGAGTCGGACAACTGTCGGCGCAAACCCTATGGATTGTGAAAAAATAGACTACATGATGCATGGTCATAGCAGGTGTCTTCCACCTCATTTGAGAGTTATGCAGCAGACTCGGTTTCACCCATTTTCCGACGCCCGTCCTTGTTTAGCTACAATCGCAGAACGACGTCCCGCGACTGCAGAGTGGTTGCCCAGAATATCGCTCATATGGTCAATGAAGCCGCGGACATGTGACCTGGAATGTCATAGCTAAGGTCTCGAACGGTCATATTAGAGGCATGGAACCGGTCCTTTATGGCGCGGTGCCAACCTTAC
>seq33
TCACCTGGTGTCAAGGGGCGTTGTGACGATCTACCTGTCCAACGTAAAATCCTCACAGCTAGTAATAATATATGAATTGCCATTGGAGTACGGAACCGTCCTCCTCTGTACACTGCTATCAGAGTTCCAATATTTAGTCGTAGTTCGACTGGGTGGGTATGAGTAGTACTGTCCAGGCCCTCCATGCAGCGAGTTCTCATTTCCGATTACCTGTCTTCTTTCGCTGTAAAGAATCCATTCCCGATCGTCATCGGATAACCTTCGCCAACGCAATCAGGAGTGGATACGAATAGAACTGTCAAGTGGACACTATTCACGTTCTGGCTCACATAACCTTCCACTTAGATTTTCTTAT
>seq34
TAAGAGGGTTAGCGGTTGTTATTAGATGTCAGGCCAGCTTTCCGCCTCTGATCGGAATAGACAGCATTACCGCAGCCGGAAGTTTGTTGATCAGAATCCGAGGGTCGTGGACGATCCCTTCACTTTAGCACCCGACCGGTGGCCCCTACCCCAGCCCACGACGAATGCGAATGAAGGATAACAGGAAGCGTGTGTCGATAGCTCTCAGTTCCAGGGGGACACCGCCTAGGAACAAATCGTGCCATGGTATCGCACTTGCTAGAGTAACTACCCACGAGGAAATATGCATGCGTCCGACCTTTGTAGAAAACCTCTCTTTTAACTCACCGGAAATATAGAGCTCCTGTTGTTTTCAAGGCTAGTGAATGACGCGAAAATTCATGTATGCTA
>seq35
GTGACTCAACCCGTATCGCCAGTTAGGTTCACCAAATTTTAGATCTGGAGTGCCAGTGCTGGTAAGCACACGTCAGTTTTGGGTAACCGAAGGCATTAAAAAAAGAAGAGTAGGGGTGCAATGGTCACAGTTTGAGATCTGGTCCGAGTGGACGAAATTGACAGTTGCCCAGCAAATCCGATGATACTTCGAAGGCGGGCCGTTCCAAGGCCGCTATCCAAGTACCTGTTGCGGCCCTGAGTTAGTATCTTGGCTTAATGACTCCCTTTAGTTAGTACAATCTAGGGCGCTGCTGGACTAGGTAGTCGGATTTAAGACCCGCGCGTGTGTGCGCTGCTTTTGTGAATACAGAGAAAATCAACCGCGCGGCCCGGTTAGATCCTATGGCCGACTTGTACGATGGATTGCCTTGCACGCTATTAATGAGACCCACGCCCCCCGGAATACATCCTCGTAGCAT
>seq36
TCCGCACCTGTACCTCGGGCGGCTGTCTCCGTTCTGGGACACCACCCCCGCGGGTAATAATTAGATAGACACCAGACATACTACCGCGTGCTAGACGGAACCTAGATCAGGCCGCGACTACCCTCGCTCCCTAGGGCTGTTCCAGCGACCCATGATCCTGGGCTGGGCCTACCATATGTAACGTAAAACGCAGGACGGAGAGTCCCCTCGTCCCGGAAATCACTTCTACGCCATAATACGTTCAGCTAGAAATAGACCTGTTTAGGGCACTGTACTTCACCTATCGAGTAAGAGCAATCGAAGCGGGATCTTACAATCGAAAGCATCCTGCGGTTAGGCCAATAATTCCGCGAATCGGGAAGCTTCGACTTTCACCGGACTGGGAAAAAACCGCC
>seq37
CAAATGTCCAGCTCACGAACCTCACCATTTGAAGGAAGAAAGGTCTACGAGACGAGACTGGGTGGACCTATCAGAATTGCCCGCGTGCTACTCTGCAGTGTTTTCAAGCTACGATTCGCTACCAAATTCAATTTGGAACTGCGCCGGTTGATGCTTCTATGCACTCCGTTCGTGTGGAGGAGAATGCACTCCTACGTTGTGGATGTTTTATATCCCAACGATGGTGGAATCGGCGCGTTACGTCCGTAGGAATTAGGGCGCGGTTACGCACATAGAGAATGATGATCTGGACACGATGTAAAGTATTCGGGGCAAGTATCATCCGTAAACGTTTTTCCTTCTGTCTCGCCCGGT
>seq38
TGCATAGCCACAAAAGAATCGGTCAATGAAATATACAGAGTACGTCTCTAGGGGACCGCGTCGCCTACCAATTTTCACGCTTCTGCCTGCCGATTAATCCGCAGAGCTTATAAACTTGGTGTGGGGGGGGTATTCTAATCTCACGCGTCAAAGCCCGTGTGCGGAACCCCAATGTACCTGAATAAATCAGCGCCGTTACCGGTAGGACTAAGTACAGTAGGCGTTTTACTAGGCAAAGCGGCGCTGGTGAGCAATAACGTGATAATCTCGTAGCAGGCGGATCCTAGATTGCATAAGTTGGTCACTGACACAGGTCGTCGCTTATAGGATGCGTATGTTGCACGGACCGAGCGTAGTTCGGCGTCTTGAC
>seq39
GAGAAAGTTTAATCATGATTCTTTACACGAGGCAGGAATTCTAGTGGTCGCGACGCATCGAACCATTAAACTTTCCCGTGTCTCCTCAAGAGGCAGTCTGACAGCGCCCAAAAATCCTACCTCTAGTCATTACCATAATAGAAGGTAAGTCTGAGGCAAAGCCTGTTACAGGTCCGACAATACAACCCCCGAAGGCCTCCCCCATTCGAAAGTTGCGGATAGAACCATGGTACAGGAACGTACTTCGTCGATTCCACTAACAACAAAGAAAGTCATGTCAGTTTCCCTTGTCCAACTCGAATATTGGAAGCTAATGTATTATCGTAAACTATACTGTCCTGTAAACATAGGTGAGCTTCCCCGCACTAAAAGTCGCAGGTGACTTCTGAGCGAGGATGGGAGT
>seq40
TGATTTTTATGTAATACTCCGCTGCTTGGCACCGATGGTGAGAGTTACCTTTGCCACAGGGATGGCGCGTCGAAGCTACGTATGACAAGCGAACGCTTGCGTGGATCTTCCGATAATGAGCCATGTGAGAAATTGAGAGGGCTGAGCATACCTTCATTATCTGAACGTGTCTTCTTCAACCAAGCTGGGGGACGAGCGACTCGTGATTCCGACACTTGTCAAAAGTATGATTATGCGCTTGTTTCCGTTGCCCCGCAGAAGAGAGAATCTGGCCCGGTAAGATAGGTCGTGGTCGTCTTCGCAGAACGGGATTAGCAAGCACCACACATCGGCCCACGCGCTCTCTTACTGGGACCATCAATTTG
>seq41
GCGGACACAATCTGACGTCTGCAAAGGGAGACGGTATTATTTTTTATACCGTTCGATAGGTTGAATGAGCGACGCGAGGTGACCTCCAGAGGATGACTCTGGGCGCTTTCCATAATCATTCTACGCTTCGCACCGCCATAGGCCTGTCCTCCGAAAGTGTTGTGTACAAGCGCAGTGCGTGCTCGGTAATCTTCGACACCGGGGTCCTGGCCGTGACTTGCCTATGCTATCCGCGTGATCCGCTCCGGGCGCACATACACGGGCCCTCCAGCCTGTCACTCGTTATAATCGACACTACTACCCGTTCCCCCTGAATGAACAACTTCGGTACCGGGGAATTCATTGATCGCCGCGGTTCCCGGCTGAACTGTTCCCCCA
>seq42
GCCCATTTGCCGGCTTGGTCCCATACAGTCCGTCGGTTCCGTTTGAGTAACGGCACGTATTCACCGAAAGGACCTAGTCCGTGAAGATGACGCGGCCCTCTTGTCACGCAGTCTCCAAAGATGATTAACGACACCCTTTACCTGGCCGCCTGTTACGAGTCGACAAGTAAGGCCATAGTATGGGCCTTGGTGGGTGCTTTCGTCGGTCCCTAACATAGCTAGACGACTGCCGTGCAGACCGGACGATTCCGACCAAGTTCCGCCTATACAGTTCCAGCGTACAGATCCGAAGGTGGCGACCGGCCTGGGATCAAATACATGTACACTTAGTTTACCTGAAATTCTTGTTATGGGGTCTGGTAGCGAGGCGGTGCGCCCCACCGCTATCGTCGTCTCCGGGTTACAGTTGGGAG
>seq43
GAAACACTAAGGGAAGTCGTCGCGGTTTGACCTCCCAATGGAGGGTTTCCCGGCCTTATGAACGTTCTCAAGGGGCTAACTGGAAGCCACTTGTCTATGAGCGCCGCCCTGATATAGTCAATGCTGAACTGGGATCATCGATGACGGTATATTTACAGCCTAGAGGCCGACTGGGGCGGTAACTCGTAGGGTAAGTTTAGTTAAATCTCTCATAAGTAATAATAGAATTAGATCCTCCAAACGCTTGAAAATTGCATTGGGATGGCTGTAGAAGCATAGTATGATACTGCTAACTCAGGGCACTCGGAGGGTATGCGATGTCACAGACGTGAAGGACCAATCATAG
>seq44
ATGAGAAGCCTGATTACCGTTACGACACAGTGCGCCGGTGACGTGTAGCGTAATTTGATAGACAGTATGCTATAATCACTTAGACACGCATAATCATGGGGAACTACAGAATACCGGTGGATAAATTGTCACTAGAATTGTTGCTACCCTTTAGGCAGCGGTGAACCTGTTCCATCTCGGTACTCGCCTCTACGAAGATGAAGCCAGGATGTAATAGGTCGGGGTCGAGATGGGTCTGTAATCTATGCTGCAACAGTGAAAGGCACGGCCTAATGGTTGTAGAGCTCGGCATGACCGACGAGACTAGCACGAGGGTGTTGGGGTCTGGCCGCGCAGAAATTATCGGTGTCGCATCCCTAGTTAATTGATGTACAGACCGCCCGGCTCTTCCCCACGTTGCGTTGAATGCAATGCCTTCCATGTTCAAAATGAGTCTGCTCATTGAGCCCGT
>seq45
AATACGTCCAGTCGGCGAACGTGTAGGTAGAGCGGGGCCAAGGCCCTTTATGCTTGGCTTTCTCTTCAGCTCTATAACGTTGACCGAACCAATATCCCTTCGTAGGGTATATATAGACCTGGGCTTGATCGGAAGTATGGTGCAATAATGCCATTTTCTCTACAGACGCTGCACCGTGCGGACCGCGGTATTTCGATTTGACTCTAAGGAGACTCGTGCGCAACAGGGTTTTATTTCGTCATCCTTGTAACAATTAGTCCGGAGACGGCATCCAGGGACCCAGACTACCATGAGACCGCTACTCGGAGGCGCAGCCCCTTAATACTTTATAGGATGCCGATCAGAGCTCTGAAGTGCTCGTTGCAATACCGGCTTCAGACTTCGATGAT
>seq46
CCCGCCTTTCCATGCGGCGCCCCGCTGTTCGGATTAAGAGTGATGTTCATGACGTAACCTCCATTCCAGGCCGGAATCACTTAGTACGCCGATCGTCGAGATAGTCTTAAGTAGCTACTGAACTGAGACATTAGTCGTCAAGGGAGATTTTGATTTGTATAAGCTGGTACAGACCATAGATCAATTAGTGCCTCTTTGGAGTTGACTGGCCTGTAGGTACGTCTATGTAGCCGAGTAGTGCACTATGCGCTTTATAGCTCCCAATAATTTGTAGATAATCATAGGTATGGTGTGGGGTCTCCTGCCCGATCTCCGAGATTCGCCGTAGAGTCTTACGTTTTGAGGTAGCGATGCAATAGGTGAAGTCGTATA
>seq47
CATTTAACGTGGGTCTCGCCGCAGGGCCGTCCTGCATAATTGCTCCCGCTCACGCTTGCAGTTATGTAGTTTGATCGCCGTATGCGCCGAAAGGGTCGCGTCCGTCATGGAGATGTCGCACTGAAGGGGAGGCCATCAATGAATTCCAAATCGTTGGGATAAAACACTTCCCGTAAAGATTCGGTACGGACACTCAAGTATTAACGCGCCAATCATGCCAGTGTGATGCAACTAAACCCCGTCTAAAGGTCCATACCATTGTACTCGAAGCTACCTAGGATGGCGCGAGATACTTAGGGGTTGCCCGGGGACACTCGGGGTCCGAGCCCTCCCTCTAGCAGGGGGTATAGAC
>seq48
GCAAAGATATTCTCAAAGCTTAACTTCGTGTCGATACGTGTCTTTTAAGGAAGCCGTAGCACATTGCACTTCCGGTTCCGAACGAATATGCAAAGCTACTACGGCTAATAAATCCTCGTGATCTCATGATAACAACTAGATTGAGATCGGCTCCTAGCGCCACTACCGGCGATTGTCCTGGAGCCGGTCCGTATTGAACGGGCGACGTTCAGCAGTTGGCGATGCCAGATCAGCGCACCCGCGTAAACAACTGAGGTTACTGAAGTAGCCATAGTGCCCATATTGCTGTATCAGTGCTGGCCCCTACATACTTTCCTTATACAAAGTTTCGTTTATAGCTAGTATGCGCTGATACTACTTAGCCCACGGTTGATCACTGTCGCCGTGAGTCCCCGATT
>seq49
GTCTGGTTGAAGAATCCCGAGGCATGCTCGATAGCGGAAAACTTGTCCAGAGGCATCATGTCGGCTAACGCTGGGCTTCTCTCAGTCGGCCAGCCATTTTATATGCGGGTCCACGAGCGAGAAAGAAACATACAGCGGAGGAACGCTTTAGATGATAGATTGTTAGCGTGCCTATCTAACCGGTGCTTGGGCCTGATCGAAATTCAGAGCGAAACGGATCGGTTGGAAAAGCGGTTCGAGTACAGTACCTACACTACGGGGTGTACGCCCTCCCACGTGCTAAGGAAACCAGTCTCCCGTTTTAACAACCGTGATAAAATGACGGCGACAAGTTTTGTCTCAAACCCAAGCCTCGGCTGCATTGTAATAGCCCCGAAATCTAAGACTTAGGTGCCGGTCCTAGATCAGGGGGGAAAGTGGCTACC
Content source: jairideout/scikit-bio-cookbook
Similar notebooks: